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Introduction 


A.  Implicitness 

The  most  common  application  envisioned  in  the  design  of  vector 
processors  has  been  the  solution  of  partial  differential  equations. 

It  is  now  clear  that  many  application-oriented  researchers  are 
planning  to  use  this  vastly  increased  computational  capability  to 
make  solution  algorithms  more  implicit  [1] . For  example,  an  algorithm 
implicit  by  line  (i.e.,  in  one  dimension)  would  become  implicit  by 
strips  (1+rt  dimensions)  ; or,  more  variables  may  be  coupled  in  a multi- 
variable  problem.  It  is  usually  found  that  the  larger  the  problem 
and/or  the  more  implicit  the  algorithm,  the  greater  fraction  of  total 
formulation  and  solution  time  is  devoted  to  the  latter.  The  coding 
of  the  equation- solver  then  becomes  a critical  issue. 


B.  Vector  Length 

The  vector  length  is  the  most  obvious  and  general  concern  in 
vector  processing.  The  length  results  either  from  simultaneous  oper- 
ations on  a number  of  similar  systems  or  from  the  density  of  structure 
(coupling  of  variables  and  grid  nodes)  within  a single  system. 

In  the  solution  of  partial  differential  equations,  the  frequency 
of  variable  updating  determines  the  number  of  grid  node  equations  that 
can  be  formulated  simultaneously . A point-Jacobi  iteration  would 
allow  simultaneous  formulation  and  updating  of  all  variables.  An  ADI 
method  would  allow  simultaneous  updating  of  variables  along  a line. 

The  coupling  between  grid  points  and  between  lines  of  grid  points 
assumed  in  the  equation  solution  determines  the  number  of  systems  of 
equations  which  can  be  solved  simultaneously . Thus,  if  grid  points 
are  assumed  coupled  in  only  one  of  two  dimensions,  then  one  can  expect 
a system  of  tridiagonal  or  block  tridiagonal  matrices  which  can  be 
solved  simultaneously.  This  in  turn  yields  a vector  length  equal  to 
the  number  of  uncoupled  grid  lines.  On  the  other  hand,  an  ADI  method 
necessitates  a solution  of  a single  tridiagonal  system. 

The  most  obviously  vectorizable  algorithm  would  therefore  be 
one  in  which  variables  are  simultaneously  updated  and  minimal  coupling 
is  assumed  between  grid  nodes  and/or  variables  at  a grid  point. 
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Iterative  methods  based  on  such  schemes  have  notoriously  poor  conver- 
gence. However,  if  simultaneity — and  hence  vector  length — is  reduced 
to  increase  convergence  rate,  one  must  exploit  single  system  density 
to  achieve  long  vectors.  This  density  is  usually  manifested  by  rela- 
tively small  block  sizes,  bandwidths,  or  profiles  in  the  matrix 
structure. 

C.  Data  Flow 

With  attention  only  to  the  vector  length  and  with  the  use  of  a 
high  level  language  such  as  Fortran,  it  is  not  uncommon  to  obtain 
only  20%-50%  of  the  optimum  CRAY-1  performance  in  the  equation  solu- 
tion. The  data  flow  must  also  be  considered  to  achieve  high  per- 
formance, as  illustrated  by  the  following  instances  associated  with 
equation-solving  codes. 

(a)  With  short  vector  chained  operations,  the  CRAY-1  protocol 
results  in  large  bubbles  in  the  arithmetic  pipelines  [2] . This  occurs 
in  the  processing  of  small  dense  blocks  or  narrow  bandwidths. 

(b)  Vectors  of  any  length  on  which  little  computation  is  per- 
formed can  create  excessive  data  flow  between  memory  hierarchies. 

This  situation  prevails  in  the  above-mentioned  simultaneous  solution 
of  equations,  and  results  from  the  inherent  decoupling  of  such  sys- 
tems. 

(c)  General  equation-solving  codes — ones  which  can  accomodate 
arbitrary  problem  size  parameters — may  suffer  from  excessive  data  flow 
visa  vis  a special  code  written  for  small  problem  sizes  which  can 
maintain  critical  data  in  the  cache  memory.  For  example,  a simultan- 
eous block  tridiagonal  solver  specialized  to  a fixed  small  block 

size  can  yield  much  higher  execution  rates  than  a general  block  tri- 
diagonal solver  (to  be  demonstrated). 

The  goal  of  high  performance  in  spite  of  short  vectors  and 
apparent  data  flow  bottlenecks  appears  to  suggest  the  need  for  a 
plethora  of  specialized  and  highly-tuned  codes  to  fill  the  same  func- 
tions as  a single  code  executing  on  a scalar  processor.  However,  a 
mitigating  effect  is  the  computational  dominance  of  the  equation 
formulation  over  the  equation  solution  as  the  coupling  shrinks.  This 
observation  is  based  on  (a)  the  number  of  entries  in  a matrix  being 
of  complexity  0(nr),  and  (b)  the  triangular  factorization  operation 
count  being  0(n  ),  where  €>0  and  n represents  the  number  of  vari- 
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ables  coupled  to  each  other.  Thus,  more  inefficiency  can  be  tolerated 
• in  the  equation  solver  with  small  blocks,  bands,  and  profiles. 

D.  Report  Summary 

This  report  provides  user  documentation  for  three  classes  of  codes 
either  expected  to  be  of  general  utility  or  else  resulting  from  on- 
going specialized  algorithm  research  for  the  CRAY-1.  All  were  devel- 
oped with  aid  of  a CRAY-1  timing  simulator  [3].  Many  of  the  accumula- 
tion kernels  on  which  the  high  performance  of  the  codes  depend  are 
described  in  [4]. 

(a)  Small  dense  systems.  Highly-tuned  accumulation  kernels 
yield  codes  which  achieve  high  execution  rates  with  small  full  and 
banded  systems. 

(b)  Simultaneous  systems.  Simultaneous  full,  banded,  and  block 
tridiagonal  equation  solvers  have  been  developed  around  kernels  which 

. reduce  the  previously-mentioned  memory  traffic.  A variety  of  block 

tridiagonal  solvers  are  included  representing  the  utility  of  such 
codes. 

« 

(c)  Single  system,  single  variable,  odd-even  tridiagonal 
solver.  The  use  of  a simulator  was  deemed  essential  to  develop  this 
challenging  code  which  involves  high  memory  traffic. 

A fourth  code,  which  can  solve  general  single  system  sparse  pro- 
blems ranging  fran  block  tridiagonal  systems,  general  full  and  banded 
matrices  (which  must  be  partitioned  into  64  x 64  blocks  for  the  CRAY-1) , 
and  arbitrary-sparsity  finite  element  problems,  is  also  being  prepared 


[5]  . 
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II.  Codes  for  the  solution  of  small,  dense  systems 

A.  Solution  of  a single  full  unsymmetric  system  of  equations  (Calahan) 
Description. 

A system  of  equations  A X = B is  solved  for  X/  where 
A is  an  n x n real  matrix  and  X and  B are  n x m real 
matrices.  Double  column  accumulation  f4]  is  used  to 
achieve  full  cache  utilization  during  both  the  triang- 
ular factorization  of  A and  the  forward  and  back  sub- 
stitution of  B.  When  m = 1,  an  alternate  substitution 
routine  is  provided. 


Subroutine  call  to  triangularly  factor  A. 

CALL  FULFAC  (N,  A,  NDIMA,  IERR) 
where  N is  the  dimension  of  A 
A is  the  array  representing  A 
NDIMA  is  the  row  dimension  of  array  A 
IERR  contains  a return  code: 

IERR  = 0 implies  N = 0 
IERR  > 0 is  normal  exit 

IERR  < 0 implies  zero-valued  pivot;  position  given 
by  | IERR | . 

Subroutine  call  to  forward  and  back  substitute 
CALL  FULSOL  (N,  A,  NDIMA,  B,  M,  NDIMB) 
where  A contains  the  factored  matrix 
B is  the  array  representing  B 
M is  the  number  of  columns  of  B 
NDIMB  is  the  row  dimension  of  array  B 

Restrictions . 

(a)  N £ 64. 

(b)  no  pivoting 

(c)  Fetches  (but  not  stores)  from  main  memory  will  occur 
in  FULFAC  and  FULSOL  from  the  (n+l)st  and  (m+l)st 
columns  of  A and  B;  this  space  should  contain  data, 
not  instructions. 
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When  M = 1,  a special  call* 

CALL  FULS0L1  (N,  A,  NO IMA/  D,  M,  DNIMB) 
should  be  used  for  best  efficiency.  The  A array  is  not  altered 
in  either  FULSOL  OR  FULS0L1. 


Performance  (simulated) 

Matrix 

Factorization 

Substitution 

size 

4 

6. 5/. 47 

4. 5/. 49 

8 

23/1.1 

12/. 77 

16 

58/3.6 

27/1.5  (52/1.5) 

32 

95/18 

44/3.7 

64 

122/113 

60/11 

[Execution  rate  (MFLOPS) ] / [time  (kilo  clocks)] 
for  solution  of  a full  system  of  equations. 
Result  in  parentheses  for  substitution  of  two 
columns  of  B (M  = 2) . 


i 


j 


. *^ULS0L1  will  solve  systems  where  M > 1 and  may  be  useful  when  M is  odd 
and  the  extra  column  fetch  of  FULSOL  is  undesirable. 
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B.  Solution  of  a single  banded  unsymmetric  system  of  equations 
(Calahan) * 

Description 

A system  of  equations  A x = b is  solved  for  x,  where  A is 
a banded  real  matrix  and  x and  b are  n x 1 real  vectors. 
The  matrix  A is  stored  in  compressed  form. 


Subroutine  call  to  triangularly  factor  A. 

CALL  BANFAC  (N,  NB,  A,  NDIM) 

where  N is  the  dimension  of  A 

NB  is  the  half  bandwidth  t'i.e.,  2*NB+1  is  the  full  bandwidth.) 
A is  an  array  containing  the  elements  of  A 
NDIM  is  the  row  dimension  of  A. 

Subroutine  call  to  forward  and  back  substitute. 

CALL  BANSOL  (N,  NB,  A,  NDIM,  B) 

where  B contains  the  elements  of  b on  entry  and  x on  exit. 

Comments 

A is  stored  in  packed  form  so  that  the  (i,j)  position  of  A 
is  stored  in  the  (i  - j + NB)  + (j  - 1)*NB  address  of  A. 


i 

I 


•The  factorization  algorithm  is  a recoded  version  of  a band  solver 
written  by  T.  Jordan  of  LASL  [ 6 ] [ 7 j . The  substitution  code  is 
identical  to  Jordan's. 
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Half 

Bandwidth 

Factorization 

Substitution 

2 

3.8/14 

6. 2/8.1 

4 

9.6/19 

11/8.1 

8 

23/28 

20/8.1 

16 

52/45 

36/8.1 

32 

88/210 

65/18 

64 

117/1260 

93/49 

[Execution  rate  (MFLOPS) ]/ [time  (kiloclocks) ] for 
solution  of  a banded  system  of  equations.  Sixty 
four  equations  were  solved  except  for  half  band- 
widths  of  32  and  64,  where  128  and  256  equations 
were  solved,  respectively. 


Codes  for  the  Solution  of  Simultaneous  Systems 


Introduction 

A A A 


Let  » x = b represent  the  simultaneous  system  of  equations 
described  by  the  block-diagonal  matrix 


x'*"  bv*' 

„ (3)  = w(3) 


•A(m)  i<m) 


where  the  a'  ' are  identically-structured  n x n real  unsymmetric 

(k)  (k)  (k) 

matrices  containing  a^j  and  the  x and  b are  n x 1 real 

He)  fk) 

matrices  containing  x.>  and  b^  ' , respectively. 

Solutions  of  such  systems  on  a vector  machine  have  a number 
of  common  characteristics. 

1.  Vectors  are  defined  across  the  systems;  e.q.,  the  (i.i) 

(k) 

positions  of  all  A , k = 1,  2...m,  constitute  a single  vector. 

2.  Two  storage  array  maps  are  common. 

A A A 

Map  I.  A,  x,  and  b stored  by  column,  then  by  row,  then 
by  system,  i.e.. 


OO. 

j 

i = 1,2, . .n; 

j = 

I,  2,..: 

(k)  . 

i = 1 , 2 , . . n ; 

k = 

1,2,  • • in 

(k)  . 

l 1,2, *.n; 

k = 

1,2,  • *111 

Map  II.  A,  x,  and  b stored  by  system,  then  by  column, 
then  by  row. 


2,  . . 


I 


Map  I may  suffer  from  bank  conflicts  and  loss  of  critical 
chaining  when  the  systems  are  a multiple  of  8 address  locations 
apart. 

A number  of  simultaneous  system  solvers  are  described  in  the 
report. 

1.  Simultaneous  full  systems. 

2.  Simultaneous  banded  systems. 

3.  Simultaneous  3x3  block  tridiagonal  systems. 

4.  Simultaneous  5x5  block  tridiagonal  systems. 


r 
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B.  Solution  of  full  unsymmetric  simultaneous  systems  of  equations 
(Ames/Calahan) 

Description 

m simultaneous  systems  of  the  form  of  (1)  are  solved  when 
(k) 

the  A'  are  full  matrices. 


Subroutine  call  to  triangularly  factor  A. 

CALL  FUSFAC  (A,  N,  M,  IA) 

where  A is  the  array  containing  the  full  A*k*  matrices  of 

N is  the  dimension  of  the  A^ 

(kT 

M is  the  number  (m)  of  A'  ' matrices 


(k) 


IA  is  the  address  displacement  in  array  A between  a:. 

(k+1)  *3 


and  a 


il 


(1) 


Subroutine  call  to  forward  and  back  substitute 
CALL  FUSSOL  (B,  IB) 

where  B is  the  array  containing  the  right  hand  sides  b and 
the  solutions  x, 

IB  is  the  address  displacement  in  array  B between  bJk^  and 
(k+1)  1 

i 

Restrictions: 

(a)  M s 64 

2 

(b)  IA  2 N , i.e.,  Map  I is  used. 

(c)  no  err  monitoring  of  pivot  reciprocation. 

Performance  (simulated) 

Equations  Number  of  Systems  (M) 

per 

system  (N) 4 8 16 32 64 

2 3.1/. 52  5. 8/. 55  11/. 61  16/. 80  21/1.2 

4 5. 7/2.1  11/2.2  14/2*6  28/3.5  35/5.6 

8 9.6/11  18/11  32/13  45/18  53/31 
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C.  Solution  of  banded  unsymmetric  systems  of  equations  (Ames/ 
Calahan) 

Description: 

m simultaneous  systems  of  the  form  of  (1)  are  solved  when 


(k) 


the  A1  ' are  banded.  The  are  stored  in  compressed  form. 

A 

Subroutine  call  to  triangularly  factor  A and  forward  and  back 
substitute 

CALL  BANSIM  (A,  B,  N,  NB,  M) 

where  A is  an  M*(2*NB+1)*N  array  of  elements  of  the  banded 
A^  matrices  of  (1) 

(k) 


B is  an  M*N  array  of  elements  of  the  right  hand  side  b( 
and  the  solution 

N is  dimension  of  the  matrices  A 

NB  is  the  half  bandwidth  (i.e.,  2*NB+1  is  the  full  bandwidth) 
M is  the  number  of  systems. 


Comments 


A,  a band  matrix,  is  stored  in  packed  form  so  that  the  (i,j) 
(k) 

position  of  A is  stored  in  the  k + (i  - j + NB)*M  + 

(j  - 1) *M* (2*NB+1)  address  of  A. 

A 

b is  stored  so  that  the  ith  position  of  b 
k + (i  - 1)*M  address  of  B 


(k) 


is  stored  in  the 
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Performance  (simulated) 


Half- 

bandwidth 

(NB) 


2 


8 


Number  of  systems  (M) 


2 

4 

8 

16 

32 

64 

2.1/47.3 

4.1/47.8 

7.9/49.3 

14.5/53.9 

23./68.0 

31.3/100. 

3.1/80.7 

6.2/81.5 

12./84.0 

21.6/93.6 

32.9/123. 

43.0/188. 

4.6/156. 

9.2/157. 

17.9/161. 

31.1/186. 

45.1/256. 

56.5/409. 

[Execution  rates  (MFLOPS) ] / [ time  (kiloclocks) ] 
for  factorization,  forward,  and  back  substitution 
of  simultaneous  banded  systems.  Each  system  has 
32  equations. 


. HUi 
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D.  Simultaneous  block  tridiagonal  systems  (Sesek) 

Descriptions 

A /\ 

Consider  the  simultaneous  block  tridiagonal  system  A x = 
with  kth  system  A(k)  x(k)  = b(k) , viz 


(k) 

where  the  size  of  the  square  A?.'  is  of  fixed  size 
to  achieve  high  execution  rates. 

A 

Subroutine  call  to  triangularly  factor  A. 

3x3  block  size 

CALL  BT3FAC  (A,  B,  C,  M,  N) 

5x5  block  size 

CALL  BT5FAC  (A,  B,  C,  M,  N) 
where  A,  B,  and  C are  defined  below 
M is  the  number  of  systems 
N is  the  number  of  diagonal  blocks 

Subroutine  call  to  forward  and  back  substitution 

3x3  block  size 

CALL  BT3S0L  (A,  B,  C,  D,  M,  N) 


5x5  block  size 

CALL  BT5S0L  (A,  B,  C,  D,  M,  N) 


< £»l 


Comments 

A A A 

Storage  Map  II  is  used  for  all  A^,  x^,  and  b^.  For  block 
size  l,  the  storage  is  given  below  for  n diagonal  blocks. 


x 1 on  exit  from  code 


Performance  (simulated) 


Execution  rates  (MFLOPS) 


Number 

of 

Systems 


MFLOP  rates  for  solution  of  block  tridiagonal  systems 
for  two  block  sizes,  as  a function  of  the  number 
of  simultaneous  systems  ( = vector  length) . 


3x3 

3x3 

5x5 

5x5 

FACTOR. 

SUBS. 

FACTOR. 

SUBS. 

28.8 

27.6 

40.8 

40.1 

46.6 

46.1 

54.7 

53.6 

60.3 

58.7 

65.5 

64.1 

67.9 

66.3 

72.0 

70.0 

Code  for  the  Solution  of  a Single  Tridiagonal  System  (CalahaiySesek) 

Description 

A single  tridiagonal  system  is  completely  solved  by 
cyclic  (odd-even)  reduction.  This  method  requires 
-2.1  times  the  floating  point  computation  of  a scalar 
solution  (important  when  evaluating  the  MFLOPS  rates 
below),  but  yields,  for  large  matrices,  nearly  all 
64-length  vector  operations. 

Subroutine  call  to  triangularly  factor  A. 

CALL  TRIFAC  (A,  B,  C,  N) 

where  A contains  the  diagonal  stripe 
B contains  the  super-diagonal  stripe 
C contains  the  sub-diagonal  stripe 
N is  the  number  of  diagonal  elements 

Subroutine  call  to  forward  and  back  substitute. 

CALL  TRISOL  (A,  B,  C,  D,  N) 

where  D contains  the  right  hand  side  on  entry  and 
the  solution  on  exit. 

Performance  (simulated) 


Matrix 
size  (N) 

Factorization 

Forward  & Back 
Substitution 

15 

8.22/1.30 

4.80/2.02 

31 

13.5/1.79 

7.65/2.75 

63 

21.1/2.45 

11.8/3.71 

127 

30.3/3.55 

17.3/5.18 

255 

38.6/5.69 

23.5/7.76 

511 

45.2/9.85 

28.7/12.8 

1023 

49.7/18.1 

32.7/22.5 

204"7 

52.4/34.3 

35.6/41.3 

4095 

53.9/66.9 

37.6/78.4 

Timings  of  a cyclic  reduction  of  a single  tridiagonal  system; 
results  given  as  [exectuion  rate  (MFLOPS) ]/ [time  (kiloclocks) ] 
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* 

Restrictions: 

(1)  No  premature  termination  of  reduction  [8] 

(2)  Single  precision  (10  digit)  reciprocation  of  pivots. 

Comments 

The  coding  has  been  optimized  for  the  large  matrix  case. 

The  execution  is  then  easily  shown  to  be  memory-path  bound. 

The  algorithm  was  therefore  chosen  to  make  a minimum  number 
of  memory  accesses  per  loop,  utilizing  shifting  instead  to 
align  operand  vectors  in  the  vectors  registers.  The  resultant 
coding  was  then  optimized  to  achieve  full  memory-path  utiliza- 
tion. Simulation  shows  that  94,  87,  86%  utilization  is  achieved 
for  the  factorization,  forward,  and  back  substitutions,  respect- 
ively, as  N -*•  <*>.  Simulation  also  shows  that  as  N + «,  approxi- 
mately 2/3  of  the  execution  time  is  needed  during  factorization 
for  this  CAL  version  vis-a-vis  a Fortran  version  written  in- 
dependently.^ 


*The  present  version  of  the  code  requires  N = 2r-l,r  a positive  integer 
t By  Dave  Kersnaw,  Lawrence  Livermore  Laboratory;  full  precision 
reciprocation  was  used  in  this  version. 
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